Search CORE

16 research outputs found

Programmable Synthetic Tabular Data Generation

Author: Balunović Mislav
Vechev Martin
Vero Mark
Publication venue
Publication date: 10/07/2023
Field of study

Large amounts of tabular data remain underutilized due to privacy, data quality, and data sharing limitations. While training a generative model producing synthetic data resembling the original distribution addresses some of these issues, most applications require additional constraints from the generated data. Existing synthetic data approaches are limited as they typically only handle specific constraints, e.g., differential privacy (DP) or increased fairness, and lack an accessible interface for declaring general specifications. In this work, we introduce ProgSyn, the first programmable synthetic tabular data generation algorithm that allows for comprehensive customization over the generated data. To ensure high data quality while adhering to custom specifications, ProgSyn pre-trains a generative model on the original dataset and fine-tunes it on a differentiable loss automatically derived from the provided specifications. These can be programmatically declared using statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). We conduct an extensive experimental evaluation of ProgSyn on a number of constraints, achieving a new state-of-the-art on some, while remaining general. For instance, at the same fairness level we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset. Overall, ProgSyn provides a versatile and accessible framework for generating constrained synthetic tabular data, allowing for specifications that generalize beyond the capabilities of prior work

arXiv.org e-Print Archive

FARE: Provably Fair Representation Learning with Practical Certificates

Author: Balunović Mislav
Dimitrov Dimitar I.
Jovanović Nikola
Vechev Martin
Publication venue
Publication date: 08/06/2023
Field of study

Fair representation learning (FRL) is a popular class of methods aiming to produce fair classifiers via data preprocessing. Recent regulatory directives stress the need for FRL methods that provide practical certificates, i.e., provable upper bounds on the unfairness of any downstream classifier trained on preprocessed data, which directly provides assurance in a practical scenario. Creating such FRL methods is an important challenge that remains unsolved. In this work, we address that challenge and introduce FARE (Fairness with Restricted Encoders), the first FRL method with practical fairness certificates. FARE is based on our key insight that restricting the representation space of the encoder enables the derivation of practical guarantees, while still permitting favorable accuracy-fairness tradeoffs for suitable instantiations, such as one we propose based on fair trees. To produce a practical certificate, we develop and apply a statistical procedure that computes a finite sample high-confidence upper bound on the unfairness of any downstream classifier trained on FARE embeddings. In our comprehensive experimental evaluation, we demonstrate that FARE produces practical certificates that are tight and often even comparable with purely empirical results obtained by prior methods, which establishes the practical value of our approach.Comment: ICML 202

arXiv.org e-Print Archive

Certified Defenses: Why Tighter Relaxations May Hurt Training

Author: Baader Maximilian
Balunović Mislav
Jovanović Nikola
Vechev Martin
Publication venue
Publication date: 10/06/2021
Field of study

Certified defenses based on convex relaxations are an established technique for training provably robust models. The key component is the choice of relaxation, varying from simple intervals to tight polyhedra. Paradoxically, however, training with tighter relaxations can often lead to worse certified robustness. The poor understanding of this paradox has forced recent state-of-the-art certified defenses to focus on designing various heuristics in order to mitigate its effects. In contrast, in this paper we study the underlying causes and show that tightness alone may not be the determining factor. Concretely, we identify two key properties of relaxations that impact training dynamics: continuity and sensitivity. Our extensive experimental evaluation demonstrates that these two factors, observed alongside tightness, explain the drop in certified robustness for popular relaxations. Further, we investigate the possibility of designing and training with relaxations that are tight, continuous and not sensitive. We believe the insights of this work can help drive the principled discovery of new and effective certified defense mechanisms

arXiv.org e-Print Archive

Data Leakage in Tabular Federated Learning

Author: Balunović Mislav
Dimitrov Dimitar I.
Vechev Martin
Vero Mark
Publication venue
Publication date: 04/10/2022
Field of study

While federated learning (FL) promises to preserve privacy in distributed training of deep learning models, recent work in the image and NLP domains showed that training updates leak private data of participating clients. At the same time, most high-stakes applications of FL (e.g., legal and financial) use tabular data. Compared to the NLP and image domains, reconstruction of tabular data poses several unique challenges: (i) categorical features introduce a significantly more difficult mixed discrete-continuous optimization problem, (ii) the mix of categorical and continuous features causes high variance in the final reconstructions, and (iii) structured data makes it difficult for the adversary to judge reconstruction quality. In this work, we tackle these challenges and propose the first comprehensive reconstruction attack on tabular data, called TabLeak. TabLeak is based on three key ingredients: (i) a softmax structural prior, implicitly converting the mixed discrete-continuous optimization problem into an easier fully continuous one, (ii) a way to reduce the variance of our reconstructions through a pooled ensembling scheme exploiting the structure of tabular data, and (iii) an entropy measure which can successfully assess reconstruction quality. Our experimental evaluation demonstrates the effectiveness of TabLeak, reaching a state-of-the-art on four popular tabular datasets. For instance, on the Adult dataset, we improve attack accuracy by 10% compared to the baseline on the practically relevant batch size of 32 and further obtain non-trivial reconstructions for batch sizes as large as 128. Our findings are important as they show that performing FL on tabular data, which often poses high privacy risks, is highly vulnerable

arXiv.org e-Print Archive

Data Leakage in Federated Averaging

Author: Balunović Mislav
Dimitrov Dimitar I.
Konstantinov Nikola
Vechev Martin
Publication venue
Publication date: 01/11/2022
Field of study

Recent attacks have shown that user data can be recovered from FedSGD updates, thus breaking privacy. However, these attacks are of limited practical relevance as federated learning typically uses the FedAvg algorithm. Compared to FedSGD, recovering data from FedAvg updates is much harder as: (i) the updates are computed at unobserved intermediate network weights, (ii) a large number of batches are used, and (iii) labels and network weights vary simultaneously across client steps. In this work, we propose a new optimization-based attack which successfully attacks FedAvg by addressing the above challenges. First, we solve the optimization problem using automatic differentiation that forces a simulation of the client's update that generates the unobserved parameters for the recovered labels and inputs to match the received client update. Second, we address the large number of batches by relating images from different epochs with a permutation invariant prior. Third, we recover the labels by estimating the parameters of existing FedSGD attacks at every FedAvg step. On the popular FEMNIST dataset, we demonstrate that on average we successfully recover >45% of the client's images from realistic FedAvg updates computed on 10 local epochs of 10 batches each with 5 images, compared to only <10% using the baseline. Our findings show many real-world federated learning implementations based on FedAvg are vulnerable

arXiv.org e-Print Archive

Repository for Publications and Research Data

Learning Certified Individually Fair Representations

Author: Balunović Mislav
Fischer Marc
Ruoss Anian
Vechev Martin
Publication venue
Publication date: 01/01/2020
Field of study

Fair representation learning provides an effective way of enforcing fairness constraints without compromising utility for downstream users. A desirable family of such fairness constraints, each requiring similar treatment for similar individuals, is known as individual fairness. In this work, we introduce the first method that enables data consumers to obtain certificates of individual fairness for existing and new data points. The key idea is to map similar individuals to close latent representations and leverage this latent proximity to certify individual fairness. That is, our method enables the data producer to learn and certify a representation where for a data point all similar individuals are at

\ell_\infty

-distance at most

\epsilon

, thus allowing data consumers to certify individual fairness by proving

\epsilon

-robustness of their classifier. Our experimental evaluation on five real-world datasets and several fairness constraints demonstrates the expressivity and scalability of our approach.Comment: Conference Paper at NeurIPS 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

Efficient Certification of Spatial Robustness

Author: Baader Maximilian
Balunović Mislav
Ruoss Anian
Vechev Martin
Publication venue
Publication date: 30/01/2021
Field of study

Recent work has exposed the vulnerability of computer vision models to vector field attacks. Due to the widespread usage of such models in safety-critical applications, it is crucial to quantify their robustness against such spatial transformations. However, existing work only provides empirical robustness quantification against vector field deformations via adversarial attacks, which lack provable guarantees. In this work, we propose novel convex relaxations, enabling us, for the first time, to provide a certificate of robustness against vector field transformations. Our relaxations are model-agnostic and can be leveraged by a wide range of neural network verifiers. Experiments on various network architectures and different datasets demonstrate the effectiveness and scalability of our method.Comment: Conference Paper at AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Robustness Certification for Point Cloud Models

Author: Balunović Mislav
Lorenz Tobias
Ruoss Anian
Singh Gagandeep
Vechev Martin
Publication venue
Publication date: 01/01/2021
Field of study

The use of deep 3D point cloud models in safety-critical applications, such as autonomous driving, dictates the need to certify the robustness of these models to real-world transformations. This is technically challenging, as it requires a scalable verifier tailored to point cloud models that handles a wide range of semantic 3D transformations. In this work, we address this challenge and introduce 3DCertify, the first verifier able to certify the robustness of point cloud models. 3DCertify is based on two key insights: (i) a generic relaxation based on first-order Taylor approximations, applicable to any differentiable transformation, and (ii) a precise relaxation for global feature pooling, which is more complex than pointwise activations (e.g., ReLU or sigmoid) but commonly employed in point cloud models. We demonstrate the effectiveness of 3DCertify by performing an extensive evaluation on a wide range of 3D transformations (e.g., rotation, twisting) for both classification and part segmentation tasks. For example, we can certify robustness against rotations by

\pm

60{\deg} for 95.7% of point clouds, and our max pool relaxation increases certification by up to 15.6%.Comment: International Conference on Computer Vision (ICCV) 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

CISPA – Helmholtz-Zentrum für Informationssicherheit

Latent Space Smoothing for Individually Fair Representations

Author: Baader Maximilian
Balunović Mislav
Peychev Momchil
Ruoss Anian
Vechev Martin
Publication venue
Publication date: 26/11/2021
Field of study

Fair representation learning encodes user data to ensure fairness and utility, regardless of the downstream application. However, learning individually fair representations, i.e., guaranteeing that similar individuals are treated similarly, remains challenging in high-dimensional settings such as computer vision. In this work, we introduce LASSI, the first representation learning method for certifying individual fairness of high-dimensional data. Our key insight is to leverage recent advances in generative modeling to capture the set of similar individuals in the generative latent space. This allows learning individually fair representations where similar individuals are mapped close together, by using adversarial training to minimize the distance between their representations. Finally, we employ randomized smoothing to provably map similar individuals close together, in turn ensuring that local robustness verification of the downstream application results in end-to-end fairness certification. Our experimental evaluation on challenging real-world image data demonstrates that our method increases certified individual fairness by up to 60%, without significantly affecting task utility

arXiv.org e-Print Archive